Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NPS measurement for TC scaling #2081

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Viren6
Copy link
Contributor

@Viren6 Viren6 commented Jun 19, 2024

This PR modifies the NPS measurement for TC scaling to more closely resemble actual testing conditions. In particular, it addresses the point raised in #2077

Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.

Instead, this PR runs a bench process for each active core and takes the average NPS.

@Viren6
Copy link
Contributor Author

Viren6 commented Jun 19, 2024

@vondele
Copy link
Member

vondele commented Jun 19, 2024

I think that's reasonable direction. Things to consider (idk the right answer)

  • this will penalise SMP tests, where the actual nps will be higher than the one measured in this way. Could be solved by doing some SMP measurement for SMP tests.
  • I have observed that on very large core workers the 1 second test might actually not be such a good measurement, as the system is spawning engines and only once everything is running the measurement becomes stable.
  • This effectively changes the TC for the progression test, so will have some effect there. Maybe that's something to consider merging shortly after release (i.e. when we usually update the reference nps?).

@ppigazzini ppigazzini added enhancement worker update code changes requiring a worker update labels Jan 6, 2025
@ppigazzini
Copy link
Collaborator

ppigazzini commented Jan 6, 2025

master vs PR

image

@ppigazzini
Copy link
Collaborator

@Disservin the workers look good with the PR. As highlighted by @vondele, merging the PR will change the TC for system with high concurrency, with a jump in the PT.

@ppigazzini ppigazzini force-pushed the fix-tc-scaling branch 5 times, most recently from f10794f to 3bed7a9 Compare January 9, 2025 10:14
@ppigazzini
Copy link
Collaborator

ppigazzini commented Jan 9, 2025

New commit with: simplification with concurrent.futures, proper exception management, bench for SMP use case.

MNps master vs PR for both normal test (thread=1) and SMP test (threads=8). As expected with master there is not a difference for threads > 1.

code test Dual Xeon bmi2 Zen 4 vnni-256 core i7 popcnt
concurrency virtual cores 48 16 8
master 1 thread 0.21 0.36 0.25
master SMP 8 threads 0.21 0.37 0.26
PR 1 thread 0.13 0.15 0.18
PR SMP 8 threads 0.24 0.44 0.40

@ppigazzini ppigazzini force-pushed the fix-tc-scaling branch 4 times, most recently from 6f58f79 to 465a270 Compare January 12, 2025 22:02
@ppigazzini
Copy link
Collaborator

ppigazzini commented Jan 13, 2025

I experimented a bit with stockfish speedtest as potential replacement of stockfish bench for the worker nps benchmark, but I didn't find a real difference in precision (ratio stdev / average) when setting a comparable time for the 2 benchmarks.

I have these concerns in using speedtest (I didn't follow the development; I think that I'm missing something):

  • only bench can verify the signature of the engine
  • bench does the same deterministic computation for any worker (with 1 thread)
  • speedtest should be started with a normalized time to do comparable computations on different workers. If the normalized time is computed with the first run of bench (validation signature), we still have a bias on bench
  • on a powerful CPU bench at depth 13 takes less than 1 second. If I'm not wrong speedtest seems to take only an integer parameter for the time, so the normalization of the time between workers requires a longer computation

Here are my tests:

  • core i7 3770k - bench at depth 13 and speedtest at equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run   sf_base   sf_test      diff
  1    642062    638495     -3567
  2    637904    645367     +7463
  3    645669    644462     -1207
  4    635846    645669     +9823
  5    635552    641465     +5913
  6    627171    637020     +9849
  7    642361    640868     -1493
  8    636139    638791     +2652
  9    639976    646577     +6601
 10    636139    635260      -879
 11    636432    640273     +3841
 12    642361    643860     +1499
 13    636139    645367     +9228
 14    635260    637020     +1760
 15    637315    648401    +11086
 16    634091    641166     +7075
 17    637904    645367     +7463
 18    636726    645367     +8641
 19    643560    643860      +300
 20    641763    643860     +2097

sf_base =   638018 +/-   1820 (95%)
sf_test =   642425 +/-   1609 (95%)
diff    =     4407 +/-   1951 (95%)
speedup =   0.691% +/- 0.306% (95%)


real    0m54.408s
user    1m45.462s
sys     0m1.335s

average	638018.5	642425.75	4407.25
stdev.s	4154.695562	3673.531713	4451.640122
ratio	0.006511873	0.00571822	0.006953274


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 3 20
run   sf_base   sf_test      diff
  1    526415    528228     +1813
  2    523249    529589     +6340
  3    521034    525763     +4729
  4    521139    524121     +2982
  5    523099    522704      -395
  6    517411    522574     +5163
  7    518389    524723     +6334
  8    521461    527732     +6271
  9    527087    526548      -539
 10    519574    528365     +8791
 11    521661    529280     +7619
 12    525403    525319       -84
 13    520944    518950     -1994
 14    517173    526080     +8907
 15    516447    518368     +1921
 16    519563    522172     +2609
 17    519312    518803      -509
 18    518095    523615     +5520
 19    519267    524004     +4737
 20    513206    522394     +9188

sf_base =   520496 +/-   1508 (95%)
sf_test =   524466 +/-   1477 (95%)
diff    =     3970 +/-   1534 (95%)
speedup =   0.763% +/- 0.295% (95%)


real    1m12.916s
user    2m23.886s
sys     0m2.049s

average	520496.45	524466.6	3970.15
stdev.s	3441.513796	3370.420345	3500.21296
ratio	0.006611983	0.006426377	0.006699209
  • ryzen 7 4880u - bench at depth 13 and speedtest at equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run   sf_base   sf_test      diff
  1   1139793   1110426    -29367
  2   1139793   1136974     -2819
  3   1153135   1167781    +14646
  4   1166793   1159924     -6869
  5   1168771   1164822     -3949
  6   1153135   1146425     -6710
  7   1151210   1159924     +8714
  8   1143573   1153135     +9562
  9   1147379   1164822    +17443
 10   1149291   1158949     +9658
 11   1154100   1137912    -16188
 12   1140736   1152172    +11436
 13   1136037   1141680     +5643
 14   1134169   1136974     +2805
 15   1141680   1135102     -6578
 16   1122172   1120349     -1823
 17   1128600   1120349     -8251
 18   1110426   1099800    -10626
 19   1121260   1112217     -9043
 20   1101557   1122172    +20615

sf_base =  1140180 +/-   7505 (95%)
sf_test =  1140095 +/-   8948 (95%)
diff    =      -85 +/-   5437 (95%)
speedup =  -0.007% +/- 0.477% (95%)


real    0m27.261s
user    0m52.480s
sys     0m2.077s

average	1140180.5	1140095.45	-85.05		
stdev.s	17125.97215	20418.04072	12406.09801
ratio	0.015020404	0.017909063	0.010881225


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 1.4 20
run   sf_base   sf_test      diff
  1   1002991    977770    -25221
  2    980706    988994     +8288
  3    980046    976723     -3323
  4   1004739    983160    -21579
  5    982322    989259     +6937
  6    986482    985358     -1124
  7    984034    976997     -7037
  8   1002855    988772    -14083
  9    975933    962120    -13813
 10    992360    965047    -27313
 11    974111    983993     +9882
 12   1004514    992087    -12427
 13    970015    991746    +21731
 14    956093    978705    +22612
 15    965092    982835    +17743
 16    954497    940059    -14438
 17    968743    956396    -12347
 18    956676    955077     -1599
 19    947581    945381     -2200
 20    942819    950398     +7579

sf_base =   976630 +/-   8384 (95%)
sf_test =   973543 +/-   7232 (95%)
diff    =    -3086 +/-   6517 (95%)
speedup =  -0.316% +/- 0.667% (95%)


real    0m23.242s
user    0m40.157s
sys     0m2.577s

average	976630.45	973543.85	-3086.6
stdev.s	19129.792	16503.11905	14869.88728
ratio	0.019587544	0.016951593	0.015249803
  • dual xeon e5-2680v3 - bench at depth 13 and speedtest at equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run   sf_base   sf_test      diff
  1    582411    611054    +28643
  2    600936    608627     +7691
  3    556108    567551    +11443
  4    573451    595487    +22036
  5    586872    575605    -11267
  6    603566    558586    -44980
  7    587372    579718     -7654
  8    596518    582411    -14107
  9    612411    602775     -9636
 10    583396    590642     +7246
 11    598849    590389     -8460
 12    582165    566851    -15314
 13    589884    586373     -3511
 14    553208    583890    +30682
 15    592927    601986     +9059
 16    578745    564300    -14445
 17    578259    585377     +7118
 18    551219    565921    +14702
 19    578988    596002    +17014
 20    599630    596002     -3628

sf_base =   584345 +/-   7273 (95%)
sf_test =   585477 +/-   6744 (95%)
diff    =     1131 +/-   7880 (95%)
speedup =   0.194% +/- 1.349% (95%)


real    0m58.222s
user    1m51.579s
sys     0m3.298s

average	584345.75	585477.35	1131.6
stdev.s	16597.09802	15389.8147	17981.80839
ratio	0.028402873	0.026285927	0.030742782


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 3 20
run   sf_base   sf_test      diff
  1    493811    487185     -6626
  2    461326    498548    +37222
  3    483797    483261      -536
  4    477423    444543    -32880
  5    492298    488488     -3810
  6    491763    500013     +8250
  7    480248    471136     -9112
  8    491465    499068     +7603
  9    489841    490662      +821
 10    477191    496783    +19592
 11    486842    496983    +10141
 12    481331    482163      +832
 13    476472    491636    +15164
 14    469090    478852     +9762
 15    483396    491132     +7736
 16    493891    486311     -7580
 17    497079    491734     -5345
 18    493707    493684       -23
 19    484103    493925     +9822
 20    461060    477934    +16874

sf_base =   483306 +/-   4615 (95%)
sf_test =   487202 +/-   5555 (95%)
diff    =     3895 +/-   6174 (95%)
speedup =   0.806% +/- 1.278% (95%)


real    1m12.835s
user    2m19.874s
sys     0m4.604s

average	483306.7	487202.05	3895.35
stdev.s	10531.58584	12675.80741	14088.20323
ratio	0.021790689	0.026017558	0.029032615
  • core i7 3770k - bench at depth 20 and speedtest at equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run   sf_base   sf_test      diff
  1    814720    820250     +5530
  2    821560    824853     +3293
  3    815993    820013     +4020
  4    812556    820033     +7477
  5    815719    825736    +10017
  6    825214    834174     +8960
  7    802381    805980     +3599
  8    811954    819854     +7900
  9    823653    825154     +1501
 10    825094    829163     +4069

sf_base =   816884 +/-   4453 (95%)
sf_test =   822521 +/-   4624 (95%)
diff    =     5636 +/-   1735 (95%)
speedup =   0.690% +/- 0.212% (95%)


real    7m59.444s
user    15m55.181s
sys     0m0.667s

average	816884.4	822521	5636.6
stdev.s	7185.516531	7460.7933	2800.248687
ratio	0.008796246	0.009070642	0.003416176


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 48 10
run   sf_base   sf_test      diff
  1    500816    512525    +11709
  2    505838    509329     +3491
  3    498995    504984     +5989
  4    502396    506823     +4427
  5    496036    503498     +7462
  6    496789    504979     +8190
  7    491413    491360       -53
  8    500142    508236     +8094
  9    492447    494211     +1764
 10    497343    496064     -1279

sf_base =   498221 +/-   2723 (95%)
sf_test =   503200 +/-   4340 (95%)
diff    =     4979 +/-   2528 (95%)
speedup =   0.999% +/- 0.508% (95%)


real    9m25.012s
user    18m48.269s
sys     0m1.121s

average	498221.5	503200.9	4979.4
stdev.s	4393.888767	7002.810554	4080.178297
ratio	0.008819147	0.01391653	0.008148766
  • ryzen 7 4880u - bench at depth 20 and speedtest at equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run   sf_base   sf_test      diff
  1   1420699   1420462      -237
  2   1416137   1449753    +33616
  3   1433292   1435595     +2303
  4   1453538   1466406    +12868
  5   1402336   1447775    +45439
  6   1429611   1438331     +8720
  7   1438026   1435352     -2674
  8   1457781   1471170    +13389
  9   1454472   1437174    -17298
 10   1451860   1433111    -18749

sf_base =  1435775 +/-  11666 (95%)
sf_test =  1443512 +/-   9650 (95%)
diff    =     7737 +/-  12533 (95%)
speedup =   0.539% +/- 0.873% (95%)


real    4m22.872s
user    8m41.477s
sys     0m1.139s)

average	1435775.2	1443512.9	7737.7
stdev.s	18822.04371	15570.57138	20221.45002
ratio	0.013109325	0.010786583	0.014046146


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 27 10
run   sf_base   sf_test      diff
  1    961782    968902     +7120
  2    969845    960509     -9336
  3    921825    951732    +29907
  4    948890    948322      -568
  5    952444    956790     +4346
  6    951687    939523    -12164
  7    954338    970641    +16303
  8    869486    879096     +9610
  9    947384    940568     -6816
 10    955128    954831      -297

sf_base =   943280 +/-  17796 (95%)
sf_test =   947091 +/-  16150 (95%)
diff    =     3810 +/-   7891 (95%)
speedup =   0.404% +/- 0.837% (95%)


real    5m5.283s
user    10m8.288s
sys     0m1.521s

average	943280.9	947091.4	3810.5
stdev.s	28712.21306	26056.65103	12732.04205
ratio	0.030438667	0.027512288	0.013470407
  • dual xeon e5-2680v3 - bench at depth 20 and speedtest at equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run   sf_base   sf_test      diff
  1    768409    750526    -17883
  2    753189    757913     +4724
  3    759048    748640    -10408
  4    738689    731056     -7633
  5    738770    752722    +13952
  6    751223    746582     -4641
  7    748623    748772      +149
  8    753908    756192     +2284
  9    747733    752005     +4272
 10    741253    735840     -5413

sf_base =   750084 +/-   5789 (95%)
sf_test =   748024 +/-   5259 (95%)
diff    =    -2059 +/-   5602 (95%)
speedup =  -0.275% +/- 0.747% (95%)


real    8m28.918s
user    16m50.191s
sys     0m2.022s

average	750084.5	748024.8	-2059.7
stdev.s	9340.599222	8485.735165	9038.620667
ratio	0.012452729	0.01134419	0.012066704


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 50 10
run   sf_base   sf_test      diff
  1    482089    479438     -2651
  2    475337    463893    -11444
  3    455283    466308    +11025
  4    475370    463149    -12221
  5    479778    470888     -8890
  6    465646    473220     +7574
  7    473976    474571      +595
  8    468684    464385     -4299
  9    465412    468351     +2939
 10    461725    456763     -4962

sf_base =   470330 +/-   5229 (95%)
sf_test =   468096 +/-   4091 (95%)
diff    =    -2233 +/-   4834 (95%)
speedup =  -0.475% +/- 1.028% (95%)


real    9m30.372s
user    18m57.413s
sys     0m2.675s

average	470330		468096.6	-2233.4
stdev.s	8436.678388	6601.562293	7799.628299
ratio	0.017937785	0.014102991	0.016622778

speedtest has a slightly higher ratio of the benches between CPU, the ratio is ferly steady for either bench or speedtest from depth 13 to depth 20

i7 3770k ryzen7 4880u dual xeon e5-2680v3 i7 3770k ryzen7 4880u dual xeon e5-2680v3
bench 13 638018 1140180 584345 1 1.787065569 0.915875414
speedtest 520496 976630 483306 1 1.876344871 0.928548923
bench 20 816884 1435775 750084 1 1.757624094 0.918225844
speedtest 498221 943280 470330 1 1.893296348 0.944018819

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jan 13, 2025

  • this will penalise SMP tests, where the actual nps will be higher than the one measured in this way. Could be solved by doing some SMP measurement for SMP tests.

Done.

  • I have observed that on very large core workers the 1 second test might actually not be such a good measurement, as the system is spawning engines and only once everything is running the measurement becomes stable.

We can run bench a second time with a depth > 13, the code change is easy. We should compute the references values with the new depth, though.

  • This effectively changes the TC for the progression test, so will have some effect there. Maybe that's something to consider merging shortly after release (i.e. when we usually update the reference nps?).

To be discussed.

@ppigazzini ppigazzini force-pushed the fix-tc-scaling branch 2 times, most recently from 42fb071 to d63fcfa Compare January 14, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement worker update code changes requiring a worker update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants